Session 7: Scraping Static Web Pages

Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2024-07-30

Introduction

This Course

tinytable_5urc8on5ox3vpdsq0p3z
Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Building a Reproducible Research Project

The Plan for Today

In this session, we trap some docile data that wants to be found. We will:

  • Go over some parsing examples:
    • Wikipedia: World World Happiness Report
  • Discuss some examples of good approaches to data wrangling
  • Go into a bit more detail on requesting raw data

Original Image Source: prowebscraper.com

Joe Caione via unsplash.com

Example: World Happiness Report

Use your Browser to Scout

Use your Browser’s Inspect tool

Note: Might not be available on all browsers; use Chromium-based or Firefox.

Use rvest to scrape

library(rvest)
library(tidyverse)

# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/w/index.php?title=World_Happiness_Report&oldid=1165407285")

# 2. Parse
happy_table <- html |> 
  html_elements(".wikitable") |> # select the right element
  html_table() |>                # special function for tables
  pluck(3)                       # select the third table

# 3. No wrangling necessary
happy_table
# A tibble: 153 × 9
   `Overall rank` `Country or region` Score `GDP per capita` `Social support`
            <int> <chr>               <dbl>            <dbl>            <dbl>
 1              1 Finland              7.81             1.28             1.5 
 2              2 Denmark              7.65             1.33             1.50
 3              3 Switzerland          7.56             1.39             1.47
 4              4 Iceland              7.50             1.33             1.55
 5              5 Norway               7.49             1.42             1.50
 6              6 Netherlands          7.45             1.34             1.46
 7              7 Sweden               7.35             1.32             1.43
 8              8 New Zealand          7.3              1.24             1.49
 9              9 Austria              7.29             1.32             1.44
10             10 Luxembourg           7.24             1.54             1.39
# ℹ 143 more rows
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
#   `Freedom to make life choices` <dbl>, Generosity <dbl>,
#   `Perceptions of corruption` <dbl>
## Plot relationship wealth and life expectancy
ggplot(happy_table, aes(x = `GDP per capita`, y = `Healthy life expectancy`)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

Exercises 1

  1. Get the table with 2023 opinion polling for the next United Kingdom general election from https://en.wikipedia.org/wiki/Opinion_polling_for_the_2024_United_Kingdom_general_election
  2. Wrangle and plot the data opinion polls

Example: UK prime ministers on Wikipedia

Use your Browser to Scout

Use rvest to scrape

# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/w/index.php?title=List_of_prime_ministers_of_the_United_Kingdom&oldid=1166167337") # I'm using an older version of the site since some just changed it

# 2. Parse
pm_table <- html |> 
  html_element(".wikitable:contains('List of prime ministers')") |>
  html_table() |> 
  as_tibble(.name_repair = "unique") |> 
  filter(!duplicated(`Prime ministerOffice(Lifespan)`))

# 3. No wrangling necessary
pm_table
# A tibble: 75 × 11
   Portrait...1 Portrait...2 Prime ministerOffice(Lifespa…¹ `Term of office...4`
   <chr>        <chr>        <chr>                          <chr>               
 1 "Portrait"   "Portrait"   Prime ministerOffice(Lifespan) start               
 2 "​"           ""           Robert Walpole[27]MP for King… 3 April1721         
 3 "​"           ""           Spencer Compton[28]1st Earl o… 16 February1742     
 4 "​"           ""           Henry Pelham[29]MP for Sussex… 27 August1743       
 5 "​"           ""           Thomas Pelham-Holles[30]1st D… 16 March1754        
 6 "​"           ""           William Cavendish[31]4th Duke… 16 November1756     
 7 "​"           ""           Thomas Pelham-Holles[32]1st D… 29 June1757         
 8 ""           ""           John Stuart[33]3rd Earl of Bu… 26 May1762          
 9 ""           ""           George Grenville[34]MP for Bu… 16 April1763        
10 ""           ""           Charles Watson-Wentworth[35]2… 13 July1765         
# ℹ 65 more rows
# ℹ abbreviated name: ¹​`Prime ministerOffice(Lifespan)`
# ℹ 7 more variables: `Term of office...5` <chr>, `Term of office...6` <chr>,
#   `Mandate[a]` <chr>, `Ministerial offices held as prime minister` <chr>,
#   Party <chr>, Government <chr>, MonarchReign <chr>
<td rowspan="4">
  <span class="anchor" id="18th_century"></span>
   <b>
     <a href="/wiki/Robert_Walpole" title="Robert Walpole">Robert Walpole</a>
   </b>
   <sup id="cite_ref-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46_28-0" class="reference">
     <a href="#cite_note-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46-28">[27]</a>
   </sup>
   <br>
   <span style="font-size:85%;">MP for <a href="/wiki/King%27s_Lynn_(UK_Parliament_constituency)" title="King's Lynn (UK Parliament constituency)">King's Lynn</a>
   <br>(1676–1745)
  </span>
</td>
links <- html |> 
  html_elements(".wikitable:contains('List of prime ministers') b a") |>
  html_attr("href")
title <- html |> 
  html_elements(".wikitable:contains('List of prime ministers') b a") |>
  html_text()
tibble(name = title, link = links)
# A tibble: 90 × 2
   name                 link                                             
   <chr>                <chr>                                            
 1 Robert Walpole       /wiki/Robert_Walpole                             
 2 George I             /wiki/George_I_of_Great_Britain                  
 3 George II            /wiki/George_II_of_Great_Britain                 
 4 Spencer Compton      /wiki/Spencer_Compton,_1st_Earl_of_Wilmington    
 5 Henry Pelham         /wiki/Henry_Pelham                               
 6 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
 7 William Cavendish    /wiki/William_Cavendish,_4th_Duke_of_Devonshire  
 8 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
 9 George III           /wiki/George_III                                 
10 John Stuart          /wiki/John_Stuart,_3rd_Earl_of_Bute              
# ℹ 80 more rows

Note: these are relative links that need to be combined with https://en.wikipedia.org/ to work

Exercises 2

  1. For extracting text, rvest has two functions: html_text and html_text2. Explain the difference. You can test your explanation with the example html below.
html <- "<p>This is some text
         some more text</p><p>A new paragraph!</p>
         <p>Quick Question, is web scraping:

         a) fun
         b) tedious
         c) I'm not sure yet!</p>" |> 
  read_html()
  1. How could you convert the links objects so that it contains actual URLs?
  2. How could you add the links we extracted above to the pm_table to keep everything together?

Example: Amazon product reviews

Goal

  1. Collect all reviews from a given Amazon page
  2. Identify and extract relevant variables for each review

Scout

After clicking on reviews, then “See more reviews”, these are the links to the first three pages:

https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews
https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2
https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=3

URLs explained

  • Uniform Resource Locator (URL) is a key mechanisms used to identify resources on a website for retrieval

Getting all URLs to product reviews

  • We can spot two things:
    • there are 185 reviews for this product
    • there are 10 reviews per page
  • Which means: there should be 19 pages
  • However: Amazon limits results to 10 pages per query
  • To get more results, you could filter by stars, change sorting, etc.
pages <- 1:10
links <- paste0(
  "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_", 
  pages, "?ie=UTF8&reviewerType=all_reviews&pageNumber=", 
  pages,
  "&sortBy=recent"
)
links
 [1] "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1&sortBy=recent"  
 [2] "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2&sortBy=recent"  
 [3] "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_3?ie=UTF8&reviewerType=all_reviews&pageNumber=3&sortBy=recent"  
 [4] "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_4?ie=UTF8&reviewerType=all_reviews&pageNumber=4&sortBy=recent"  
 [5] "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_5?ie=UTF8&reviewerType=all_reviews&pageNumber=5&sortBy=recent"  
 [6] "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_6?ie=UTF8&reviewerType=all_reviews&pageNumber=6&sortBy=recent"  
 [7] "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_7?ie=UTF8&reviewerType=all_reviews&pageNumber=7&sortBy=recent"  
 [8] "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_8?ie=UTF8&reviewerType=all_reviews&pageNumber=8&sortBy=recent"  
 [9] "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_9?ie=UTF8&reviewerType=all_reviews&pageNumber=9&sortBy=recent"  
[10] "https://www.amazon.co.uk/Discovering-Statistics-Using-Andy-Field/product-reviews/1446200469/ref=cm_cr_arp_d_paging_btm_next_10?ie=UTF8&reviewerType=all_reviews&pageNumber=10&sortBy=recent"

Collecting the reviews from 1 page

# make sure to read the HTML in only once
html <- read_html(links[1])

# we select review elements first
reviews <- html |> 
  # this is a trick I had to google, ^= means starts with
  html_elements("[id^=customer_review-]")

# then extract the information from these reviews
rating <- reviews |> 
  html_element(".review-rating") |> 
  html_text2()

title <- reviews |> 
  html_elements(".a-letter-space+span") |> 
  html_text2()

date <- reviews |> 
  html_elements(".review-date") |> 
  html_text2()

text <- reviews |> 
  html_elements(".review-text-content") |> 
  html_text2()

tibble(date, rating, title, text)
# A tibble: 10 × 4
   date                                               rating         title text 
   <chr>                                              <chr>          <chr> <chr>
 1 Reviewed in the United Kingdom on 11 October 2023  4.0 out of 5 … Grea… This…
 2 Reviewed in the United Kingdom on 27 October 2021  1.0 out of 5 … I do… Plot…
 3 Reviewed in the United Kingdom on 6 October 2021   1.0 out of 5 … Kind… The …
 4 Reviewed in the United Kingdom on 8 September 2020 5.0 out of 5 … Stat… Didn…
 5 Reviewed in the United Kingdom on 27 October 2019  4.0 out of 5 … Awes… Awes…
 6 Reviewed in the United Kingdom on 3 September 2019 5.0 out of 5 … Grea… Clea…
 7 Reviewed in the United Kingdom on 16 February 2019 5.0 out of 5 … a go… It i…
 8 Reviewed in the United Kingdom on 10 February 2019 5.0 out of 5 … Grea… High…
 9 Reviewed in the United Kingdom on 8 October 2018   5.0 out of 5 … Grea… Now …
10 Reviewed in the United Kingdom on 6 June 2018      5.0 out of 5 … Grea… Abso…

X people found this helpful: incomplete cases

helpful <- reviews |> 
  html_elements("[data-hook=\"helpful-vote-statement\"]") |> 
  html_text2()
helpful
[1] "One person found this helpful" "2 people found this helpful"  
[3] "One person found this helpful" "3 people found this helpful"  
[5] "One person found this helpful"
tibble(date, rating, helpful, title, text)
Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 10: Existing data.
• Size 5: Column at position 3.
ℹ Only values of size one are recycled.

Iterate over cases, rather than variable values

What we did before extracts one specific value from each case. But we have no mechanism to deal with missing values! To solve that, we need to iterate over reviews rather than extracting title, text, date and rating from all elements at once.

parse_review <- function(r) {
  rating <- r |> 
    html_element(".review-rating") |> 
    html_text2()
  
  title <- r |> 
    html_elements(".a-letter-space+span") |> 
    html_text2()
  
  date <- r |> 
    html_elements(".review-date") |> 
    html_text2()
  
  text <- r |> 
    html_elements(".review-text-content") |> 
    html_text2()
  
  helpful <- r |> 
    html_elements("[data-hook=\"helpful-vote-statement\"]") |> 
    html_text2()
  
  if (length(helpful) == 0) {
    helpful <- NA_character_
  }
  
  tibble(date, rating, helpful, title, text)
}

map(reviews, parse_review) |> 
  bind_rows()
# A tibble: 10 × 5
   date                                               rating helpful title text 
   <chr>                                              <chr>  <chr>   <chr> <chr>
 1 Reviewed in the United Kingdom on 11 October 2023  4.0 o… <NA>    Grea… This…
 2 Reviewed in the United Kingdom on 27 October 2021  1.0 o… One pe… I do… Plot…
 3 Reviewed in the United Kingdom on 6 October 2021   1.0 o… 2 peop… Kind… The …
 4 Reviewed in the United Kingdom on 8 September 2020 5.0 o… <NA>    Stat… Didn…
 5 Reviewed in the United Kingdom on 27 October 2019  4.0 o… One pe… Awes… Awes…
 6 Reviewed in the United Kingdom on 3 September 2019 5.0 o… <NA>    Grea… Clea…
 7 Reviewed in the United Kingdom on 16 February 2019 5.0 o… <NA>    a go… It i…
 8 Reviewed in the United Kingdom on 10 February 2019 5.0 o… 3 peop… Grea… High…
 9 Reviewed in the United Kingdom on 8 October 2018   5.0 o… One pe… Grea… Now …
10 Reviewed in the United Kingdom on 6 June 2018      5.0 o… <NA>    Grea… Abso…

Iterate over pages to collect all reviews

get_reviews_from_page <- function(link) {
  html <- read_html(link)
  
  reviews <- html |> 
    html_elements("[id^=customer_review-]")
  
  map(reviews, parse_review) |> 
    bind_rows()
}

all_reviews <- map(links, get_reviews_from_page) |> 
  bind_rows()
glimpse(all_reviews)
Rows: 49
Columns: 5
$ date    <chr> "Reviewed in the United Kingdom on 11 October 2023", "Reviewed…
$ rating  <chr> "4.0 out of 5 stars", "1.0 out of 5 stars", "1.0 out of 5 star…
$ helpful <chr> NA, "One person found this helpful", "2 people found this help…
$ title   <chr> "Great content to support business studies degree", "I don't r…
$ text    <chr> "This book contains pretty much everything you need to know ab…

Exercises 3

We might be interested in whether a purchase was verified or not

  1. Extract that information from the first review page
  2. Add the variable verified to the parse_review function
  3. Create all_reviews again, but with the verified variable this time

Example: Getting content from embedded json

Goal

  1. Collect news articles from news.sky.com
  2. Get the text of an article, the headline, date, and author

Scout

Let’s look for the date information in the page source

A Wild JSON string Appears!

  • JavaScript Object Notation (json) is a way of storing complicated nested data in plain text (see session 3)
  • data is put into a character string that indicates object types and relation of objects
  • R knows how to read json strings/files and can easily process them
library(jsonlite)
json_string <- list(x = 1:10, y = list(z = 1:10, a = LETTERS[1:10])) |> 
  toJSON()
json_string
{"x":[1,2,3,4,5,6,7,8,9,10],"y":{"z":[1,2,3,4,5,6,7,8,9,10],"a":["A","B","C","D","E","F","G","H","I","J"]}} 
fromJSON(json_string)
$x
 [1]  1  2  3  4  5  6  7  8  9 10

$y
$y$z
 [1]  1  2  3  4  5  6  7  8  9 10

$y$a
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
  • essentially we seem to get pre-packaged data here

Obtain the JSON string

# 1. Request & collect raw html
html <- read_html("https://news.sky.com/story/crowdstrike-company-that-caused-global-techno-meltdown-offers-partners-10-vouchers-to-say-sorry-and-they-dont-work-13184488")

# 2. Parse
json_string <- html |> 
  rvest::html_element("[type=\"application/ld+json\"]") |>
  rvest::html_text() 

# 3. wrangling (part 1)
data <- jsonlite::fromJSON(json_string)

Wrangling part 2

From here it is really straightforward to extract (most of) the relevant information:

# datetime
datetime <- pluck(data, "datePublished") |>
  lubridate::as_datetime()

# headline
headline <- pluck(data, "headline")

# author
author <- pluck(data, "name")

The only thing missing from this data is the article itself…

Getting the article

text <- html |>
  rvest::html_elements(".sdc-article-body p") |>
  rvest::html_text2() |>
  paste(collapse = "\n")
sky_article <- tibble(datetime, author, headline, text)
glimpse(sky_article)
Rows: 1
Columns: 3
$ datetime <dttm> 2024-07-24 20:33:00
$ headline <chr> "CrowdStrike: Company that caused global techno meltdown off…
$ text     <chr> "The firm behind the global IT outage that cost companies bil…

Exercises 4

  1. Get the author, publication datetime, headline and text from this site: https://www.cnet.com/tech/services-and-software/facebook-hopes-to-normalize-idea-of-data-scraping-leaks-says-leaked-internal-memo/ (hint: it works in a very similar way, but you have to apply one extra data wrangling step)

Example: zeit.de

Special Requests: Behind Paywall

Let’s get this cool data journalism article.

html <- read_html("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende")
html |> 
  html_elements(".article-body p") |> 
  html_text2()
[1] "Ganz Deutschland fährt Bahn. So fühlte sich das im Sommer 2022 zumindest an, als das 9-Euro-Ticket für drei Monate für überfüllte Züge sorgte. Die Bundesregierung und viele Menschen zeigten sich begeistert: So leicht war es also, Bürgerinnen und Bürger für die umweltfreundlichen öffentlichen Verkehrsmittel zu begeistern, man muss nur ein günstiges Ticket für ganz Deutschland anbieten."
[2] "Aber als die Bundesregierung den Nachfolger vorstellte, waren viele enttäuscht. 49 Euro monatlich kostet das Deutschlandticket und ist nur im Abo erhältlich. Euphorisch war nur noch die Bundesregierung. Doch jetzt, ein Jahr nach dem Start, kann man sagen: zu Recht. Zumindest, was die Fahrgastzahlen angeht."                                                                                

🤔 Wait, that’s only the first two paragraphs!

Tip

Websites use cookies to remember users (including logged in ones)

What are browser cookies

  • Small pieces of data stored on the user’s device by the web browser while browsing websites
  • Purpose:
    • Session Management: Maintain user sessions by storing login information and keeping users logged in as they navigate a website.
    • Personalization: Save user preferences, such as language settings or theme choices, to enhance user experience.
    • Tracking and Analytics: Track user behavior across websites for analytics and targeted advertising.
  • We can use them in scraping:
    • to get content from websites that require consent before giving access
    • to authenticate as a user with content access privileges
    • to access personalized content
    • to simulate real user behavior, reducing the chances of getting blocked by websites with anti-scraping measures
  • You can use browser extensions like “Get cookies.txt” for Chromium-based browsers or “cookies.txt” for Firefox to save your cookies to a file
  • Implications:
    • You need to keep cookies secure as they can authenticate others as you!

Special Requests: Behind Paywall Cookies!

library(cookiemonster)
add_cookies("cookies.txt")
library(httr2)
html <- request("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende") |> # start a request
  req_options(cookie = get_cookies("zeit.de", as = "string")) |> # add cookies to be sent with it
  req_perform() |> 
  resp_body_html() # extract html from response

text <- html |> 
  html_elements(".article-body p") |> 
  html_text2()
length(text)
[1] 14

Example: South African Parliament (a special case)

Goal

  1. Collect information about any financial interest of South Africa’s Members of Parliament

Download files for processing

dir.create("data/za", showWarnings = FALSE)
if (!file.exists("data/za/2018.pdf")) {
  # multi_download is a neat little function that parallesises file download
  curl::multi_download(
    urls = interest_pdfs$link, 
    destfiles = interest_pdfs$file_name
  )
}

Scraping data from PDFs?

  • Data inside a PDF is actually not such an uncommon case
  • Many institutions share PDFs with tables, images and lists of data
  • We can use some of our new pattern finding skills to scrape data from these PDFs as well though
    • Session names seem to be in a larger font and bold
    • Paper titles are in italics
    • Authors are either bold or plain font

Let’s investigate the PDF a little

library(pdftools)
interests_pdf <- pdf_data("data/za/2018.pdf", font_info = TRUE)
glimpse(interests_pdf[[2]])
Rows: 172
Columns: 8
$ width     <int> 78, 31, 24, 6, 31, 16, 26, 40, 42, 25, 6, 29, 6, 57, 52, 33,…
$ height    <int> 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, …
$ x         <int> 35, 115, 150, 41, 54, 87, 106, 134, 177, 54, 82, 90, 41, 54,…
$ y         <int> 65, 65, 65, 83, 83, 83, 83, 83, 83, 91, 91, 91, 109, 109, 10…
$ space     <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE…
$ text      <chr> "Abraham-Ntantiso,", "Phoebe", "(ANC)", "1.", "SHARES", "AND…
$ font_name <chr> "Arial-BoldMT", "Arial-BoldMT", "Arial-BoldMT", "Arial-BoldM…
$ font_size <dbl> 8.775, 8.775, 8.775, 7.500, 7.500, 7.500, 7.500, 7.500, 7.50…

We see here that:

  • each page is an element in a list
  • each word is in one row of the table
  • it contains the font_size and font_name
  • the position of each word on tha page is given with x and y coordinates

Let’s investigate the PDF a little

Let’s investigate a few words we saw above:

# a politician name
interests_pdf[[2]] |> 
  filter(str_detect(text, "Abrahams,"))
# A tibble: 1 × 8
  width height     x     y space text      font_name    font_size
  <int>  <int> <int> <int> <lgl> <chr>     <chr>            <dbl>
1    45      9    35   473 TRUE  Abrahams, Arial-BoldMT      8.78
# an item header
interests_pdf[[2]] |> 
  filter(str_detect(text, "1"))
# A tibble: 7 × 8
  width height     x     y space text  font_name    font_size
  <int>  <int> <int> <int> <lgl> <chr> <chr>            <dbl>
1     6      8    41    83 TRUE  1.    Arial-BoldMT      7.50
2    10      8    39   350 TRUE  10.   Arial-BoldMT      7.50
3    10      8    40   376 TRUE  11.   Arial-BoldMT      7.50
4    10      8    39   403 TRUE  12.   Arial-BoldMT      7.50
5    10      8    39   429 TRUE  13.   Arial-BoldMT      7.50
6     6      8    41   492 TRUE  1.    Arial-BoldMT      7.50
7    10      8    39   749 TRUE  10.   Arial-BoldMT      7.50
# a disclose
interests_pdf[[2]] |> 
  filter(str_detect(text, "disclose"))
# A tibble: 19 × 8
   width height     x     y space text      font_name font_size
   <int>  <int> <int> <int> <lgl> <chr>     <chr>         <dbl>
 1    29      8    90    91 FALSE disclose. ArialMT        7.50
 2    29      8    90   118 FALSE disclose. ArialMT        7.50
 3    29      8    90   144 FALSE disclose. ArialMT        7.50
 4    29      8    90   170 FALSE disclose. ArialMT        7.50
 5    29      8    90   196 FALSE disclose. ArialMT        7.50
 6    29      8    90   268 FALSE disclose. ArialMT        7.50
 7    29      8    90   295 FALSE disclose. ArialMT        7.50
 8    29      8    90   358 FALSE disclose. ArialMT        7.50
 9    29      8    90   385 FALSE disclose. ArialMT        7.50
10    29      8    90   411 FALSE disclose. ArialMT        7.50
11    29      8    90   437 FALSE disclose. ArialMT        7.50
12    29      8    90   500 FALSE disclose. ArialMT        7.50
13    29      8    90   526 FALSE disclose. ArialMT        7.50
14    29      8    90   553 FALSE disclose. ArialMT        7.50
15    29      8    90   579 FALSE disclose. ArialMT        7.50
16    29      8    90   605 FALSE disclose. ArialMT        7.50
17    29      8    90   631 FALSE disclose. ArialMT        7.50
18    29      8    90   658 FALSE disclose. ArialMT        7.50
19    29      8    90   684 FALSE disclose. ArialMT        7.50
# a word inside table
interests_pdf[[2]] |> 
  filter(str_detect(text, "Pringle"))
# A tibble: 1 × 8
  width height     x     y space text    font_name font_size
  <int>  <int> <int> <int> <lgl> <chr>   <chr>         <dbl>
1    23      8   188   721 TRUE  Pringle ArialMT        7.50
# a table header
interests_pdf[[2]] |> 
  filter(str_detect(text, "Description"))
# A tibble: 3 × 8
  width height     x     y space text        font_name    font_size
  <int>  <int> <int> <int> <lgl> <chr>       <chr>            <dbl>
1    41      8    56   223 FALSE Description Arial-BoldMT      7.50
2    41      8    56   322 FALSE Description Arial-BoldMT      7.50
3    41      8    56   711 FALSE Description Arial-BoldMT      7.50

Findings:

  • It looks like we can say relatively easily where a new politician entry starts based on the font
  • The item header has the same font name, but a different size
  • We can tell quite easily on which items there is nothing to disclose
  • The table colnames are similar to item headers, but start at a different x location

Test extract info from one page

p1 <- interests_pdf[[2]]
# add whether politician name
p1 |> 
  mutate(is_name = font_name == "Arial-BoldMT" & 
           round(font_size, 3) == 8.775) |> 
  # add whether header
  mutate(is_header = font_name == "Arial-BoldMT" & 
           round(font_size, 1) == 7.5)
# A tibble: 172 × 10
   width height     x     y space text     font_name font_size is_name is_header
   <int>  <int> <int> <int> <lgl> <chr>    <chr>         <dbl> <lgl>   <lgl>    
 1    78      9    35    65 TRUE  Abraham… Arial-Bo…      8.78 TRUE    FALSE    
 2    31      9   115    65 TRUE  Phoebe   Arial-Bo…      8.78 TRUE    FALSE    
 3    24      9   150    65 FALSE (ANC)    Arial-Bo…      8.78 TRUE    FALSE    
 4     6      8    41    83 TRUE  1.       Arial-Bo…      7.50 FALSE   TRUE     
 5    31      8    54    83 TRUE  SHARES   Arial-Bo…      7.50 FALSE   TRUE     
 6    16      8    87    83 TRUE  AND      Arial-Bo…      7.50 FALSE   TRUE     
 7    26      8   106    83 TRUE  OTHER    Arial-Bo…      7.50 FALSE   TRUE     
 8    40      8   134    83 TRUE  FINANCI… Arial-Bo…      7.50 FALSE   TRUE     
 9    42      8   177    83 FALSE INTERES… Arial-Bo…      7.50 FALSE   TRUE     
10    25      8    54    91 TRUE  Nothing  ArialMT        7.50 FALSE   FALSE    
# ℹ 162 more rows

Wrangle into shape

p1_df <- p1 |> 
  # one line in the PDF is all on the same y position
  group_by(y) |> 
  # so we can summarise, ie. make one row out of one line
  summarise(
    # we retain only the first value from font_name, font_size and x
    # since they are all the same anyway
    font_name = head(font_name, 1),
    font_size = head(font_size, 1),
    x = head(x, 1),
    # we use paste with collapse to get several character values into
    # one string per line
    text = paste(text, collapse = " "), 
    # dropping groups as we don't need them
    .groups = "drop"
  ) |> 
  # we check whether a line is a name
  mutate(is_name = font_name == "Arial-BoldMT" & 
           round(font_size, 3) == 8.775) |> 
  # and add a unique ID per person
  mutate(id = cumsum(is_name)) |> 
  # now we do the same per disclosure item
  mutate(is_header = font_name == "Arial-BoldMT" & 
           round(font_size, 1) == 7.5 &
           x < 50) |> 
  mutate(item_id = cumsum(is_header))

Wrangle into shape

p1_df_tidy <- p1_df |> 
  # we group by person
  group_by(id) |> 
  # and add a new variable with their name
  mutate(
    name = text[is_name],
  ) |> 
  ungroup() |> 
  # now we can remove the rows that contain the name in the text
  filter(!is_name) |> 
  # we do the same per item
  group_by(item_id) |> 
  mutate(
    item = text[is_header],
    content = paste(text[!is_header], collapse = "\n")
  ) |> 
  # this produces a lot of duplicates, which we can get rid of with distinct
  distinct(name, item, content)
glimpse(p1_df_tidy)
Rows: 23
Columns: 4
Groups: item_id [23]
$ item_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ name    <chr> "Abraham-Ntantiso, Phoebe (ANC)", "Abraham-Ntantiso, Phoebe (A…
$ item    <chr> "1. SHARES AND OTHER FINANCIAL INTERESTS", "2. REMUNERATED EMP…
$ content <chr> "Nothing to disclose.", "Nothing to disclose.", "Nothing to di…

Apply to whole PDF

First, we have to add a page number to each site:

for (p in seq_along(interests_pdf)) {
  interests_pdf[[p]]$page <- p
}

Apply to whole PDF

I’m sorry for this long code 😅

interests_df <- bind_rows(interests_pdf) |> 
  # one line in the PDF is all on the same y position
  group_by(page, y) |> 
  # so we can summarise, ie. make one row out of one line
  summarise(
    # we retain only the first value from font_name, font_size and x
    # since they are all the same anyway
    font_name = head(font_name, 1),
    font_size = head(font_size, 1),
    x = head(x, 1),
    # we use paste with collapse to get several character values into
    # one string per line
    text = paste(text, collapse = " "), 
    # dropping groups as we don't need them
    .groups = "drop"
  ) |> 
  # we check whether a line is a name
  mutate(is_name = font_name == "Arial-BoldMT" & 
           round(font_size, 3) == 8.775) |> 
  # and add a unique ID per person
  mutate(id = cumsum(is_name)) |> 
  # everything before the first name can be removed
  filter(id != 0) |> 
  # now we do the same per disclosure item
  mutate(is_header = font_name == "Arial-BoldMT" & 
           round(font_size, 1) == 7.5 &
           x < 50) |> 
  mutate(item_id = cumsum(is_header))  |> 
  # we group by person
  group_by(id) |> 
  # and add a new variable with their name
  mutate(
    name = text[is_name],
  ) |> 
  ungroup() |> 
  # now we can remove the rows that contain the name in the text
  filter(!is_name) |> 
  # we do the same per item
  group_by(item_id) |> 
  mutate(
    item = text[is_header],
    content = paste(text[!is_header], collapse = "\n")
  ) |> 
  # this produces a lot of duplicates, which we can get rid of with distinct
  ungroup() |> 
  distinct(id, name, item, content)

Exercises 5

  1. In the folder /data (relative to this document) there is a PDF with some text. Read it into R
  2. The PDF has two columns, parse the left column of the first page into one object and the right into another
  3. Now combine them in the correct order bring the text in the right order as a human would read it
  4. Let’s assume you wanted to have this text in a table with one column indicating the section and one having the text of the section
  5. Now let’s assume you wanted to parse this on the paragraph level instead (hint: remember str_split_1)

Optional Homework

You have seen some tools and tricks to scrape websites now. But your best ally in web scraping is experience! Until tomorrow noon, your task is to find a page on Wikipedia you find interesting and scrape content from there. Even if you don’t fully succeed, document the steps you take and note down where the information can be found. If you want to try to get some data you actuall need from a different website, your’re also welcome. But note that if you collect raw html in R and the data is not where it should be (e.g., the html elements containing panel names do not exist), you might have discovered a more advanced site, which we will cover later. Note that down and try another conference.

Deadline: Friday before class

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pdftools_3.4.0      httr2_1.0.1         cookiemonster_0.0.3
 [4] jsonlite_1.8.8      rvest_1.0.4         lubridate_1.9.3    
 [7] forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4        
[10] purrr_1.0.2         readr_2.1.5         tidyr_1.3.1        
[13] tibble_3.2.1        ggplot2_3.5.1       tidyverse_2.0.0    
[16] tinytable_0.3.0.10 

loaded via a namespace (and not attached):
 [1] gtable_0.3.5      xfun_0.44         websocket_1.4.1   processx_3.8.4   
 [5] lattice_0.22-6    tzdb_0.4.0        vctrs_0.6.5       tools_4.4.1      
 [9] ps_1.7.7          generics_0.1.3    curl_5.2.1        fansi_1.0.6      
[13] pkgconfig_2.0.3   Matrix_1.7-0      lifecycle_1.0.4   compiler_4.4.1   
[17] farver_2.1.2      munsell_0.5.1     chromote_0.2.0    htmltools_0.5.8.1
[21] yaml_2.3.8        later_1.3.2       pillar_1.9.0      openssl_2.2.0    
[25] nlme_3.1-164      tidyselect_1.2.1  digest_0.6.35     stringi_1.8.4    
[29] labeling_0.4.3    splines_4.4.1     fastmap_1.1.1     grid_4.4.1       
[33] colorspace_2.1-0  cli_3.6.3         magrittr_2.0.3    triebeard_0.4.1  
[37] utf8_1.2.4        withr_3.0.0       scales_1.3.0      promises_1.3.0   
[41] rappdirs_0.3.3    timechange_0.3.0  rmarkdown_2.26    httr_1.4.7       
[45] qpdf_1.3.3        askpass_1.2.0     hms_1.1.3         evaluate_0.23    
[49] knitr_1.46        mgcv_1.9-1        rlang_1.1.4       urltools_1.7.3   
[53] Rcpp_1.0.12       glue_1.7.0        selectr_0.4-2     xml2_1.3.6       
[57] rstudioapi_0.16.0 R6_2.5.1